[SOUND]
In general, we can use the empirical count
of events in the observed data
to estimate the probabilities.
And a commonly used technique is
called a maximum likelihood estimate,
where we simply normalize
the observe accounts.
So if we do that, we can see, we can
compute these probabilities as follows.
For estimating the probability that
we see a water current in a segment,
we simply normalize the count of
segments that contain this word.
So let's first take
a look at the data here.
On the right side, you see a list of some,
hypothesizes the data.
These are segments.
And in some segments you see both words
occur, they are indicated as ones for
both columns.
In some other cases only one will occur,
so only that column has one and
the other column has zero.
And in all, of course, in some other
cases none of the words occur,
so they are both zeros.
And for estimating these probabilities, we
simply need to collect the three counts.
So the three counts are first,
the count of W1.
And that's the total number of
segments that contain word W1.
It's just as the ones in the column of W1.
We can count how many
ones we have seen there.
The segment count is for word 2, and we
just count the ones in the second column.
And these will give us the total
number of segments that contain W2.
The third count is when both words occur.
So this time, we're going to count
the sentence where both columns have ones.
And then, so this would give us
the total number of segments
where we have seen both W1 and W2.
Once we have these counts,
we can just normalize these counts by N,
which is the total number of segments, and
this will give us the probabilities that
we need to compute original information.
Now, there is a small problem,
when we have zero counts sometimes.
And in this case, we don't want a zero
probability because our data may be
a small sample and in general, we would
believe that it's potentially possible for
a [INAUDIBLE] to avoid any context.
So, to address this problem,
we can use a technique called smoothing.
And that's basically to add some
small constant to these counts,
and so that we don't get
the zero probability in any case.
Now, the best way to understand smoothing
is imagine that we actually observed more
data than we actually have, because we'll
pretend we observed some pseudo-segments.
I illustrated on the top,
on the right side on the slide.
And these pseudo-segments would
contribute additional counts
of these words so
that no event will have zero probability.
Now, in particular we introduce
the four pseudo-segments.
Each is weighted at one quarter.
And these represent the four different
combinations of occurrences of this word.
So now each event,
each combination will have
at least one count or at least a non-zero
count from this pseudo-segment.
So, in the actual segments
that we'll observe,
it's okay if we haven't observed
all of the combinations.
So more specifically, you can see
the 0.5 here after it comes from the two
ones in the two pseudo-segments,
because each is weighted at one quarter.
We add them up, we get 0.5.
And similar to this,
0.05 comes from one single
pseudo-segment that indicates
the two words occur together.
And of course in the denominator we add
the total number of pseudo-segments that
we add, in this case,
we added a four pseudo-segments.
Each is weighed at one quarter so
the total of the sum is, after the one.
So, that's why in the denominator
you'll see a one there.
So, this basically concludes
the discussion of how to compute a these
four syntagmatic relation discoveries.
Now, so to summarize,
syntagmatic relation can generally
be discovered by measuring correlations
between occurrences of two words.
We've introduced the three
concepts from information theory.
Entropy, which measures the uncertainty
of a random variable X.
Conditional entropy, which measures
the entropy of X given we know Y.
And mutual information of X and Y,
which matches the entropy reduction of X
due to knowing Y, or
entropy reduction of Y due to knowing X.
They are the same.
So these three concepts are actually very
useful for other applications as well.
That's why we spent some time
to explain this in detail.
But in particular,
they are also very useful for
discovering syntagmatic relations.
In particular,
mutual information is a principal way for
discovering such a relation.
It allows us to have values
computed on different pairs of
words that are comparable and
so we can rank these pairs and
discover the strongest syntagmatic
from a collection of documents.
Now, note that there is some relation
between syntagmatic relation discovery and
[INAUDIBLE] relation discovery.
So we already discussed the possibility
of using BM25 to achieve waiting for
terms in the context to potentially
also suggest the candidates
that have syntagmatic relations
with the candidate word.
But here, once we use mutual information
to discover syntagmatic relations,
we can also represent the context with
this mutual information as weights.
So this would give us
another way to represent
the context of a word, like a cat.
And if we do the same for all the words,
then we can cluster these words or
compare the similarity between these
words based on their context similarity.
So this provides yet
another way to do term weighting for
paradigmatic relation discovery.
And so to summarize this whole part
about word association mining.
We introduce two basic associations,
called a paradigmatic and
a syntagmatic relations.
These are fairly general, they apply
to any items in any language, so
the units don't have to be words,
they can be phrases or entities.
We introduced multiple statistical
approaches for discovering them,
mainly showing that pure
statistical approaches are visible,
are variable for
discovering both kind of relations.
And they can be combined to
perform joint analysis, as well.
These approaches can be applied
to any text with no human effort,
mostly because they are based
on counting of words, yet
they can actually discover
interesting relations of words.
We can also use different ways with
defining context and segment, and
this would lead us to some interesting
variations of applications.
For example, the context can be very
narrow like a few words, around a word, or
a sentence, or maybe paragraphs,
as using differing contexts would
allows to discover different flavors
of paradigmatical relations.
And similarly,
counting co-occurrences using let's say,
visual information to discover
syntagmatical relations.
We also have to define the segment, and
the segment can be defined as a narrow
text window or a longer text article.
And this would give us different
kinds of associations.
These discovery associations can
support many other applications,
in both information retrieval and
text and data mining.
So here are some recommended readings,
if you want to know more about the topic.
The first is a book with
a chapter on collocations,
which is quite relevant to
the topic of these lectures.
The second is an article
about using various
statistical measures to
discover lexical atoms.
Those are phrases that
are non-compositional.
For example,
hot dog is not really a dog that's hot,
blue chip is not a chip that's blue.
And the paper has a discussion about some
techniques for discovering such phrases.
The third one is a new paper on a unified
way to discover both paradigmatical
relations and a syntagmatical relations,
using random works on word graphs.
[SOUND]

